Search CORE

22 research outputs found

Canonical, Stable, General Mapping using Context Schemes

Author: Haussler David
Novak Adam
Paten Benedict
Rosen Yohei
Publication venue: 'Oxford University Press (OUP)'
Publication date: 11/06/2015
Field of study

Motivation: Sequence mapping is the cornerstone of modern genomics. However, most existing sequence mapping algorithms are insufficiently general. Results: We introduce context schemes: a method that allows the unambiguous recognition of a reference base in a query sequence by testing the query for substrings from an algorithmically defined set. Context schemes only map when there is a unique best mapping, and define this criterion uniformly for all reference bases. Mappings under context schemes can also be made stable, so that extension of the query string (e.g. by increasing read length) will not alter the mapping of previously mapped positions. Context schemes are general in several senses. They natively support the detection of arbitrary complex, novel rearrangements relative to the reference. They can scale over orders of magnitude in query sequence length. Finally, they are trivially extensible to more complex reference structures, such as graphs, that incorporate additional variation. We demonstrate empirically the existence of high performance context schemes, and present efficient context scheme mapping algorithms. Availability and Implementation: The software test framework created for this work is available from https://registry.hub.docker.com/u/adamnovak/sequence-graphs/. Contact: [email protected] Supplementary Information: Six supplementary figures and one supplementary section are available with the online version of this article.Comment: Submission for Bioinformatic

arXiv.org e-Print Archive

PubMed Central

eScholarship - University of California

An Average-Case Sublinear Exact Li and Stephens Forward Algorithm

Author: Paten Benedict J.
Rosen Yohei M.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)
Publication date: 01/01/2018
Field of study

Hidden Markov models of haplotype inheritance such as the Li and Stephens model allow for computationally tractable probability calculations using the forward algorithms as long as the representative reference panel used in the model is sufficiently small. Specifically, the monoploid Li and Stephens model and its variants are linear in reference panel size unless heuristic approximations are used. However, sequencing projects numbering in the thousands to hundreds of thousands of individuals are underway, and others numbering in the millions are anticipated. To make the Li and Stephens forward algorithm for these datasets computationally tractable, we have created a numerically exact version of the algorithm with observed average case O(nk^{0.35}) runtime in number of genetic sites n and reference panel size k. This avoids any tradeoff between runtime and model complexity. We demonstrate that our approach also provides a succinct data structure for general purpose haplotype data storage. We discuss generalizations of our algorithmic techniques to other hidden Markov models

Dagstuhl Research Online Publication Server

eScholarship - University of California

Recommended from our members

Tools for large and detailed experiments in genomics and tissue development

Author: Rosen Yohei Maurice
Publication venue: eScholarship, University of California
Publication date: 01/01/2022
Field of study

In this dissertation I present algorithmic and data representation advances in genomics as well as tools for a new bioinformatic approach to mammalian cell culture experiments which I call highly instrumented cell culture. The first section deals with fast variants of the forward algorithm for the Li and Stephens copying model of haplotypes derived from a population. I introduce a direct optimization of the Li and Stephens model forward algorithm which performs the identical calculation, without any approximations, but achieves this in average case sublinear time. This is an improvement over the classical algorithm which is at best linear time. I achieve this by using a sparse representation of the population haplotypes and by introducing an efficient lazy evaluation scheme. I also introduce a generalization of the recombination modeling component of the Li and Stephens model which operates on haplotypes and populations encoded in variation graphs. The second section deals with algebraic representations of genetic sites in variation graphs. I introduce the concept of the bundle, a motif in bidirected graphs which leads to a well defined concept of adjacency of sets of nodes. This allows a granular decomposition of the graph into sites which extends prior work on ultrabubbles and snarls previously reported by Paten et al. Lastly, I introduce the concept of highly instrumented cell culture and some technologies to enable it. I demonstrate a low-cost, robust, arbitrarily scalable microscope array for simultaneous parallel continuous time-series microscopy. I demonstrate new approaches to rapid prototyping of labware and fluidic actuators. I also demonstrate principles and implementation of incubator-free cell culture, which is my approach to cell culture in media containing carbonic acid-carbonate-bicarbonate buffer systems without using any carbon dioxide rich gas chamber. I finally describe how these technologies integrate together to enable the creation of highly instrumented, automated, data rich biology experiments

eScholarship - University of California

Recommended from our members

Modelling haplotypes with respect to reference cohort variation graphs

Author: Eizenga Jordan
Paten Benedict
Rosen Yohei
Publication venue: eScholarship, University of California
Publication date: 15/07/2017
Field of study

MotivationCurrent statistical models of haplotypes are limited to panels of haplotypes whose genetic variation can be represented by arrays of values at linearly ordered bi- or multiallelic loci. These methods cannot model structural variants or variants that nest or overlap.ResultsA variation graph is a mathematical structure that can encode arbitrarily complex genetic variation. We present the first haplotype model that operates on a variation graph-embedded population reference cohort. We describe an algorithm to calculate the likelihood that a haplotype arose from this cohort through recombinations and demonstrate time complexity linear in haplotype length and sublinear in population size. We furthermore demonstrate a method of rapidly calculating likelihoods for related haplotypes. We describe mathematical extensions to allow modelling of mutations. This work is an important incremental step for clinical genomics and genetic epidemiology since it is the first haplotype model which can represent all sorts of variation in the population.Availability and implementationAvailable on GitHub at https://github.com/yoheirosen/vg [email protected] informationSupplementary data are available at Bioinformatics online

eScholarship - University of California

An average-case sublinear forward algorithm for the haploid Li and Stephens model

Author: Benedict J. Paten
Yohei M. Rosen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2019
Field of study

Abstract Background Hidden Markov models of haplotype inheritance such as the Li and Stephens model allow for computationally tractable probability calculations using the forward algorithm as long as the representative reference panel used in the model is sufficiently small. Specifically, the monoploid Li and Stephens model and its variants are linear in reference panel size unless heuristic approximations are used. However, sequencing projects numbering in the thousands to hundreds of thousands of individuals are underway, and others numbering in the millions are anticipated. Results To make the forward algorithm for the haploid Li and Stephens model computationally tractable for these datasets, we have created a numerically exact version of the algorithm with observed average case sublinear runtime with respect to reference panel size k when tested against the 1000 Genomes dataset. Conclusions We show a forward algorithm which avoids any tradeoff between runtime and model complexity. Our algorithm makes use of two general strategies which might be applicable to improving the time complexity of other future sequence analysis algorithms: sparse dynamic programming matrices and lazy evaluation

Directory of Open Access Journals

Recommended from our members

An Average-Case Sublinear Exact Li and Stephens Forward Algorithm.

Author: Paten Benedict J
Rosen Yohei M
Publication venue: eScholarship, University of California
Publication date: 01/01/2018
Field of study

eScholarship - University of California

Recommended from our members

Superbubbles, Ultrabubbles, and Cacti

Author: Eizenga Jordan M
Garrison Erik
Hickey Glenn
Novak Adam M
Paten Benedict
Rosen Yohei M
Publication venue: eScholarship, University of California
Publication date: 01/07/2018
Field of study

A superbubble is a type of directed acyclic subgraph with single distinct source and sink vertices. In genome assembly and genetics, the possible paths through a superbubble can be considered to represent the set of possible sequences at a location in a genome. Bidirected and biedged graphs are a generalization of digraphs that are increasingly being used to more fully represent genome assembly and variation problems. In this study, we define snarls and ultrabubbles, generalizations of superbubbles for bidirected and biedged graphs, and give an efficient algorithm for the detection of these more general structures. Key to this algorithm is the cactus graph, which, we show, encodes the nested decomposition of a graph into snarls and ultrabubbles within its structure. We propose and demonstrate empirically that this decomposition on bidirected and biedged graphs solves a fundamental problem by defining genetic sites for any collection of genomic variations, including complex structural variations, without need for any single reference genome coordinate system. Further, the nesting of the decomposition gives a natural way to describe and model variations contained within large variations, a case not currently dealt with by existing formats [e.g., variant cell format (VCF)]

eScholarship - University of California

Canonical, stable, general mapping using context schemes

Author: Adam M. Novak
Benedict Paten
David Haussler
Dilthey
Harris
Li
Medvedev
Paten
Schneider
Yohei Rosen
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Crossref

Superbubbles, Ultrabubbles, and Cacti

Author: Adam M. Novak
Benedict Paten
de Bruijn N.G.
Erik Garrison
Glenn Hickey
Jordan M. Eizenga
Pevzner P.
Yohei M. Rosen
Publication venue: 'Mary Ann Liebert Inc'
Publication date
Field of study

Crossref

Picroscope: low-cost system for simultaneous longitudinal biological imaging.

Author: Baudin Pierre V
Cordero Sergio A
Haussler David
Jung Erik A
Ly Victoria T
Mantalas Gary L
Mostajo-Radji Mohammed A
Nowakowski Tomasz J
Pansodtee Pattawong
Pollen Alex A
Rolandi Marco
Rosen Yohei M
Ross Jayden M
Salama Sofie R
Seiler Spencer T
Selberg John A
Teodorescu Mircea
Voitiuk Kateryna
Willsey Helen Rankin
Publication venue: eScholarship, University of California
Publication date: 01/11/2021
Field of study

Simultaneous longitudinal imaging across multiple conditions and replicates has been crucial for scientific studies aiming to understand biological processes and disease. Yet, imaging systems capable of accomplishing these tasks are economically unattainable for most academic and teaching laboratories around the world. Here, we propose the Picroscope, which is the first low-cost system for simultaneous longitudinal biological imaging made primarily using off-the-shelf and 3D-printed materials. The Picroscope is compatible with standard 24-well cell culture plates and captures 3D z-stack image data. The Picroscope can be controlled remotely, allowing for automatic imaging with minimal intervention from the investigator. Here, we use this system in a range of applications. We gathered longitudinal whole organism image data for frogs, zebrafish, and planaria worms. We also gathered image data inside an incubator to observe 2D monolayers and 3D mammalian tissue culture models. Using this tool, we can measure the behavior of entire organisms or individual cells over long-time periods

Directory of Open Access Journals

eScholarship - University of California